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ABSTRACT 


All NLP applications have fundamental task of POS(Parts of Speech) Tagging. Like Grammar Checking, Speech processing, Machine translation etc. that assign the 
correct tag to the word for a number of available tags. The accuracy of a tagger is the biggest challenge today. A lot of taggers have been proposed by different 
Researchers for the different languages (Telugu, Tamil, Kannada, Punjabi, Hindi, Bengali etc.) using different techniques like HMM (Hidden Markov Model), SVM 
(Support Vector Machine), ME (Maximum Entropy) etc. A Telugu POS tagger based on HMM model is one of them. This tagger uses Hidden Markov Model., a 
statistical technique to accurately tag the words in Telugu language using 630 tags developed by Rama Sree, R.J, Kusuma Kumari,2007.This large tag set (630 
tags)results in data sparseness problem. Finally the result has been manually evaluated from a linguistic person. To cope up with this problem, in this research paper an 
experiment with reduced POS Tag set (36 tags) proposed by Technical Development of Indian Languages (TDIL) has been used to improve the tagging accuracy of 


HMM based POS Tagger. 
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INTRODUCTION: 

Telugu language belongs to Indo-Aryan family of languages (Dravidian lan- 
guages). Other members that belong to this family are Kannada, Tamil, 
Malayalam, Hindi, Bengali, Gujarati, and Marathi etc. Telugu is spoken in India, 
Canada, USA, UK, and other countries with Telugu immigrants. Telugu lan- 
guage is the 8th most widely spoken language in the world, 4th most spoken lan- 
guage in Canada (The Times of India, 14" February, 2008) and the 9th in India 
with more than 45 million speakers. It is the official language of Telugu states 
(Andhra Pradesh and Telangana). 


The first treatise on Telugu grammar, the "Andhra Shabda Chintamani" was writ- 
ten in Sanskrit by Nannaya who was considered as the first poet and translator of 
Telugu in the 11th century A.D. There was no grammatical work in Telugu prior 
to Nannayya's "Andhra shabda chintamani". This grammar followed the patterns 
which existed in grammatical treatises like Astadhvavi and Valmiki vyakaranam 
but unlike Paninni. Nannayya divided his work into five chapters, covering 
Samjna, Sandhi, Ajanta, Halanta and Kriya. After Nannayya, Atharvana and 
Ahobala composed Sutras, Vartika and, Bhashyam. Like Nannayya, they had pre- 
viously written their works in Sanskrit. 


In the 19" century, Chinnaya Suri wrote a simplified work on Telugu grammar 
called Balavyakaranam borrowing concepts and ideas from Nannayya's Andhra 
Shabda Chintamani, and wrote his literary work in Telugu. Every Telugu gram- 
matical rule is derived from Paninian, Katvayana, and Patanjali concepts. How- 
ever high percentage of Paninian aspects and techniques borrowed in Telugu. 


According to Nannayya language without 'Niyama" or the language which does- 
n't adhere to Vyakaranam is called Gramya or Apabramsa and hence it is unfit for 
literary usage. All the literary texts in Telugu follows Vyakaranam. Compared to 
languages like English, Telugu is a morphologically rich language and has rela- 
tively free word order. It follows a Subject-Object-Verb (S-O-V) pattern. Itis: 


Sentence erent ber 
Words 


eres? aoe aes) 
Transliteration Baludu pataselaku bayalderadu 
Gloss Boy towards the school moved 
Parts subject object verb 


Translation Boy moved towards the school 


This sentence can also be interpreted as 'Boy moved towards the school’ depend- 
ing on the context. But it does not affect the SOV order. 


In most of the natural language processing applications like grammar checking, 
sentence identification, phrase chunking etc. the computer required only gram- 
matical information of the input text. This grammatical information is given in 
the form tags called part of speech tags The process of assigning the correct tag to 
the word from a number of available tags is called POS Tagging. Here the gram- 
matical information of the word is called Tag .It is well known that a computer 


will understand and process the language if the meaning of each and every word 
of that language is known. The parts of speech are different word classes in which 
aword lies like noun, adjective, verb etc. A word can occur in more than one word 
class in different context. Same word can act as a noun in one sentence and the 
same can act asa verb in other sentence. 


So in order to assign the exact grammatical information to a word one must know 
the context in which that word has occurred. For a computer system it is very dif- 
ficult to understand the context of the sentence. Therefore different techniques 
are used to assign the part of speech tag to a word. 


RELATED WORK: 

Different Techniques used for POS Tagging of Indian Languages: 

There are basically three techniques used for part of speech tagging. 1) Rule 
based method 2) Statistical based method and 3) Neural network based method. 
Besides these three, a hybrid method is also used. This hybrid method is the com- 
bination of two or three of above mention techniques. In rule based technique dif- 
ferent hand written rules are used for disambiguation of tags. These rules are 
developed manually. Therefore thorough knowledge of language is required to 
develop the rules. This rule based technique has been used by Sreeganesh (2006) 
for Telugu language; another rule based POS tagger was developed for Punjabi 
language by Mandeep Singh Gill, Gurpreet Singh Lehal (2008). Statistical 
method is another technique commonly used for part of speech tagging. Most 
commonly used statistical methods are support vector machine (SVM) used 
byEkbal and S. Bandyopadhyay(2008) for Bengali language; V.Dhanalakshmi et 
al. (2008)for Tamil language, M Anandkumar, Vijaya M.S, Loganathan R, 
Soman K.P, Rjendran S$ (2008) ;SindhiyaBinulal et al. for POS tagging of Telugu 
language. Antony P.J et al. for Malyalam language. Hidden markov model based 
technique used by Manish Shrivastava&Pushpak Bhattacharyya for POS tagger 
for Hindi language; Manju K et al. for Malayalam language; NavanathSaharia 
et.al for Assamese; Sanjeevkumar Sharma et al. (2011) for Punjabi Language; 
Ekbal, S. Mondal et al. for Bengali language. Maximum entropy based technique 
was used by AniketDalal et al. for Hindi language; Ekbal et al. (2008) for Bengali 
language. 


Conditional Random Field based technique has been used by Ravindran et al. 
and Himanshu et al. for POS tagging and chunking of Hindi language; other 
Indian languages on which this CRF technique has been applied are Bengali and 
Manipuri. Neural network based technique has been used by Ankur Parikh for 
Hindi Language. In hybrid based approach used a combination of rule based and 
HMM based technique has been used by Arulmozhi P et al. for development of 
Tamil POS tagger; Chirag Patel and KarthikGali used a combination of rule 
based method and CRF for Gujarati Pos tagger 


Existing POS Tagger of Telugu Language: 

In Indo-Aryan family, Telugu language is a most popular language. It is also 
known as Indian language. Other members of this family are Hindi, Bengali, 
Gujarati, and Marathi, ,Punjabi etc. Two POS tagging system with two different 
techniques have been developed for Telugu language .by RamaSree, R. J., and P. 
Kusuma Kumari in .2007. The first system was developed as a sub part of gram- 
mar checker project. These rules were implemented by using regular expression. 
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The main reason for using this rule based technique was that the rules can be 
edited i.e. new rules can be added or deleted. 


A tag set of more than 630 tags was also developed. Second POS tagging system 
has been developed by using statistical method. Hidden Markov model was used 
to disambiguate the tags. Viterby algorithm was used for implementation of Hid- 
den Markov model. The tag set used in the second system was same as was pro- 
posed by Mandeep singh et al. They also tried a hybrid approach that is combina- 
tion of rule based system and statistical approach in which the output of rule 
based system was fed to the statistical based system. This gives further improve- 
ment on the accuracy of the POS tagger. 


Tag set: 

The grammatical information of the language represents a set of all the tags used 

in a tag set. The number of tags used for a language depends upon the length of 
the tag which further depends upon the amount of information that we want to rep- 
resent using a tag. e.g. if just basic word class is to be represented with each word 

then the length of the tag will be 2, 3 or 4. One extra character will be required for 

extra grammatical information that is to be represented with tag. E.g. to represent 

only word class we can use NN tag for noun. But if we want to represent gender 

information also then an extra character will be added to this tag. This extra char- 

acter may be M for masculine gender, F for feminine gender and B for both types. 

Therefore proposed tag for a masculine noun will be NNM, for feminine it will be 

NNF and for both categories it may be NNB. This extra information not only 

increases the length of the tag but also increase the no of tags. As in above case if 
the information of only word class is to be represented then only one tag was suf- 

ficient and as the information increases the number of tags also increase and 

becomes 3 in above case. From above discussion it is concluded that the informa- 

tion has a direct effect on the number of tags. 


Telugu tag set: 

Existing Telugu POS Tag set: 

Two POS tagger has been developed for Telugu language are discussed in above 
section. In both of these POS taggers same tag set has been used. This tag set was 
developed by keeping in mind that this POS tagger has to be used for grammar 
checking software of Telugu language. This tag set was fine grained and more 
than 630 tags were used. 


New proposed Tag set by TDIL: 

A number of POS tag sets have been developed by different organizations are 
depending on some general principle of tag set design strategy. For POS annota- 
tion of texts in Telugu, we have used tag set proposed by TDIL (Technical Devel- 
opment of Indian Languages). There were 36 tags proposed by TDIL for Telugu 
language. 


INTRODUCTION TO HMM (Hidden Markov Model): 

Hidden Markov Model is a statistical model used to solve classification type of 
problems. It was proposed by L. E. Baum. This model is used to assigns the joint 
probability to paired observation and label sequence. In order to maximize the 
joint likelihood of training sets parameters are trained. In NLP applications this 
type of training is done by using accurate annotated corpus. The main advantage 
of this model is that it is easy to understand and implemented. The accuracy of 
this model is directly proportional to size of training data. 


Basic Definitions and Notation: 
According to (Rabiner, 1989), the HMM can be defines by using the following 
five elements: 


1. N, itis the number of distinct states in the model. For part-of-speech tag- 
ging, N is the total number of tags that can be used by the system. In 
existing system these are more than 360 tags and in our propose system 
these are 36 tags only. Each possible tag for the system corresponds to 
one state of the HMM. 


2. M, the number of distinct output symbols in the alphabet of the HMM. 
For part-of-speech tagging, M is the number of words in the lexicon of 
the system. As the exact number of words ofa language can’t be counted 
so the distinct words present in the training corpus is taken as M. 


3. A= {a,}, the state transition probability distribution. Where a,is the 
probability that the system will move from state i to state j in one transi- 
tion. For part-of-speech tagging, these states are represented by the tags, 
so a,is the probability that the model will move from tag t, to ¢. This prob- 
ability can be estimated using training corpus. 


4. B= {b(k)), the emission probability. The probability b,(k)is the proba- 
bility that the k-th output symbol will be emitted when the model is in 
state j. For part-of-speech tagging, this is the probability that the word 
W.will be assign tag ¢(i.e.,P(W,|t)). This probability can be again esti- 
mated from a training corpus. 


5. []= {I}. the initial state distribution. It is the probability that the model 
will start in state i. For part-of-speech tagging, this is the probability that 
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the sentence will begin with tag ti. When using an HMM to perform 
part-of speech tagging, the goal is to determine the most likely sequence 
of tags (states) that generates the words in the sentence (sequence of out- 
put symbols). In other words, given a sentence V, calculate the sequence 
U of tags that maximizes P (V/U). The Viterbi algorithm is a common 
method for calculating the most likely tag sequence when using an 
HMM. The proposed model is a type of first order HMM, also referred 
to as bigram POS tagging. For POS-tagging problem presented Hidden 
Markov Model is composed of two probabilities: lexical (emission) 
probability and contextual (transition) probability (Samuelsson, 1996). 


(t,,...,.t,) =argmax P(t,.....,t_)|(W,.....W,) 


& f, 


Using Baye's law above equation can be rewritten as: 


Pltisevcheasese tl Wosiss deci sW JEP (tis. tivenceeciaees t,) X 
POW syewviers JW ltiecwens st) 
POW giiveees »W,) 
(tiekeniisns wt,J= arg max P(t,....,t,,) XP (4.0.0... W ltoiateeee: t,) 
(Crzsmoaes t,)= arg max P(t,,.......... CW oseascezeets »W,) 


= argmaxil(P(ti|ti — 1)(TRANSITION PROBABILITY) * P(Wi|Ti)(EMMISION PROBABLITY)) 


METHODOLOGY: 
Following flow diagram shows the steps used to implement new tag set in HMM. 


Collection of raw corpus 


Annotation of corpus either manually or using 


existing POS tagger 


Filter out the unambiguous sentences if tagged using 
Rule based POS tagger 


Create a map file that could map the old tags with 
new tags 


Use filtered unambiguous sentences to generate 


emission and transition probabilities 


Use viter by algorithm to implement HMM 


Step 1:- collection ofraw corpus. 
We collected a large accurate corpus of nearly 200 pages containing 
nearly 8000 sentences and approximately 42,000 words. This corpus 
was collected from internet. Following web sites were used for the col- 
lection of corpus: 


¢ — http://telugutribuneonline.com 
* www.teluguinfoline.com 


Step 2:- Annotation of corpus. 
For the annotation we used an existing lexicon based morphological ana- 
lyzer. This morphological analyzer contains more than one million Tel- 
ugu words with their part of speech tag. 


Step 3:- Mapping with new tags. 
Since the existing morph used in step 2 contains tags that have been 
developed using a tag set of 630 tags and in new system we have to use 
the tags proposed by TDIL, so mapping of tags was done to reduce the 


International Education & Research Journal [IERJ] 


Research Paper 


630 tags to 36 tags. The tags of annotated corpus developed in above 
step were converted in to standard common tags set for Indian Lan- 
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Sample transition file 


guages and as per IIIT tag set guidelines. This mapping was partially Tag1/Tag2 pair Probability 
done manually and partially by computer by using a map file. The map N_NN/V 0.190476 
file was manually developed with the help of linguistic. 
V/V_VM 0.005376 
Mapped File V_VM/RP 0.040698 
S. No Old tag starts with New tag REERERE O0LGas4 
PR_PRP/N_NN 0.01066 
1 NN N_NN/N_NST/N_ = = 
NNP N_NN/PSP 0.470769 
2 PNP PR_PRP PSP/N_NN 0.028161 
3 PNR PR PREF N_NN/V_VM 0.135289 
4 PND PR PRC Step 5:- Viterby algorithm was used to implement HMM. 
5 PNI PR PRI . . 
= Experimental Evaluation: 
6 PNE PR_PRL The accuracy of Natural Language product is generally measured in terms preci- 
7 PNN PR PRO sion and recall. Precision is the percentage of correctly disambiguate tags. And 
— recall is IfA is the number of correctly disambiguate tags and B is the number of 
8 AJl JJ tags that were not disambiguate by our system then 
9 CD QTC 
10 OD QT_O Recall = A / (A+B) 
1 PPU PSP Similarly if A is the no of correctly disambiguated tags and C is the number of 
12 AVI RB incorrectly disambiguated tags then Precision=A/(C+A) 
13 FYI PSP For evaluation of the proposed POS Tagger, a corpus having texts from different 
14 AVU RB online resources i.e. Telugu websites were used. The outcome was manually eval- 
15 VBP V uated through a linguistic expert to mark the correct and incorrect disambiguate 
tags. The result obtained has been given in Table 1. The precision and recall val- 
16 CJ cc ues are given in table 2. 
Ly, PTU RP 
18 PTV RP Table 1: Experimental Result 
19 AJU JJ No of Existing HMM Proposed 
20 VBO Vv Total unknown | No of | based system system 
number 
21 VBMA V_VM Corpus of words (not | known | No of correctly |No of correctly 
yy BVAX V VAUX words | @8ged by | words | disambiguated | disambiguated 
F386 ad the system) tags tags 
23 Unknown RD_UNK 
- Essay | 6894 442 6420 6132 6409 
24 Comma , Dot, Question Sentence, 
Sentence News 5387 384 4853 4803 4783 
Eeeguon Short || 7639 | 142 (| e304 8146 8231 
: stories 
Colon, Semicolon, 
Opening Single Novel | 2705 322 3754 3765 3268 
Quote, Closing Single Book 
Quote, Chapter 3876 74 3765 3754 3174 
Opening Double RD_PUNC 
Quote, Closing Table 2: Precision and Recall of HMM based and Proposed system 
Double Quote 
25 Opening Bracket Corpus| Existing HMM based system Proposed System 
Opening Brace, type A B_ |C| Precision | Recall | A |B] C |Precision| Recall 
Closing Bracket, Essay 16548] 846 [0] 100% | 84.5% [6528] 0/71 | 98.8% | 100% 
Closing Brace 
: - News | 4723] 537]0] 100% 85% |4698]} 0 | 70] 98.4% | 100% 
Opening Parenthesis, RD SYM 
Closing Parenthesis Short | 9573 |2286]0) 100% | 85.5% |9297) 0 | 47| 98.2% | 100% 
Less Than, Greater stories 
Than Novel | 3287] 345]0} 100% 86% {3387} 0 | 40] 99.3% | 100% 
26 VBMAXPSXXTNE V_VM_VNG Book |3864|378]0} 100% | 86.5% |3184] 0 | 34] 98.7% | 100% 
27 AJU RP_ INTF Chapter 
28 PTUN RP__NEG CONCLUSION: 
29 AJIMSD QT QTF Hidden Morkov Model (HMM) for Parts of Speech Tagging is of implementa- 
30 N i ich but RD RDF tion of reducing the tag set has been done by efforts to improve the accuracy of 
. i hile rarer nace - HMM based Telugu POS Tagger. The tag set has been reduced from more than 
generally unknown ae : : 
4 630 tags to 36 tags. We observed a significant improvement in the accuracy of tag- 
eas ging. Our proposed tagger shows an accuracy of 95-98% whereas the existing 
31 No specific tag but words RD_ECH HMM based POS Tagger was reported to give an accuracy of 84-88%. This sig- 
with hyphen in between nificant improvement is due to reduction in the tag set from more than 630 tags to 
them 36 tags. The main problem with large tag set results in data sparseness. The 
reduction in the tags results in reduction in data sparseness and hence improves 
32 VEMAZOOCINIAN | _V_V¥M_VINF the accuracy and enhancement. 
33 VBMAXXXXXINDIAN| V_VM_VNF 


Step 4:- Development of emission and transition probability file. Now in order 
to find out transition and emission probabilities we developed an appli- 
cation in visual studio (c#.net). The probability files were kept in txt for- 
mat. These file were used for implementing HMM using viterby algo- 
rithm 
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Future scope: 

The usage of a hybrid model i.e. combination of more than one statistical model 
can be further extended. The same approach can also be implemented for differ- 
ent Indian languages like Hindi, Bengali, and Punjabi etc. Applying different 
approaches to find the POS tag of unknown words are yet to done for improve- 
ment. 
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